33 research outputs found
Estimating the maximum expected value in continuous reinforcement learning problems
This paper is about the estimation of the maximum expected value of an infinite set of random variables. This estimation problem is relevant in many fields, like the Reinforcement Learning (RL) one. In RL it is well known that, in some stochastic environments, a bias in the estimation error can increase step-by-step the approximation error leading to large overestimates of the true action values. Recently, some approaches have been proposed to reduce such bias in order to get better action-value estimates, but are limited to finite problems. In this paper, we leverage on the recently proposed weighted estimator and on Gaussian process regression to derive a new method that is able to natively handle infinitely many random variables. We show how these techniques can be used to face both continuous state and continuous actions RL problems. To evaluate the effectiveness of the proposed approach we perform empirical comparisons with related approaches
Deep Reinforcement Learning with Weighted Q-Learning
Overestimation of the maximum action-value is a well-known problem that
hinders Q-Learning performance, leading to suboptimal policies and unstable
learning. Among several Q-Learning variants proposed to address this issue,
Weighted Q-Learning (WQL) effectively reduces the bias and shows remarkable
results in stochastic environments. WQL uses a weighted sum of the estimated
action-values, where the weights correspond to the probability of each
action-value being the maximum; however, the computation of these probabilities
is only practical in the tabular settings. In this work, we provide the
methodological advances to benefit from the WQL properties in Deep
Reinforcement Learning (DRL), by using neural networks with Dropout Variational
Inference as an effective approximation of deep Gaussian processes. In
particular, we adopt the Concrete Dropout variant to obtain calibrated
estimates of epistemic uncertainty in DRL. We show that model uncertainty in
DRL can be useful not only for action selection, but also action evaluation. We
analyze how the novel Weighted Deep Q-Learning algorithm reduces the bias
w.r.t. relevant baselines and provide empirical evidence of its advantages on
several representative benchmarks.Comment: Corrected typo
On the Benefit of Optimal Transport for Curriculum Reinforcement Learning
Curriculum reinforcement learning (CRL) allows solving complex tasks by
generating a tailored sequence of learning tasks, starting from easy ones and
subsequently increasing their difficulty. Although the potential of curricula
in RL has been clearly shown in various works, it is less clear how to generate
them for a given learning environment, resulting in various methods aiming to
automate this task. In this work, we focus on framing curricula as
interpolations between task distributions, which has previously been shown to
be a viable approach to CRL. Identifying key issues of existing methods, we
frame the generation of a curriculum as a constrained optimal transport problem
between task distributions. Benchmarks show that this way of curriculum
generation can improve upon existing CRL methods, yielding high performance in
various tasks with different characteristics
MushroomRL: Simplifying Reinforcement Learning Research
MushroomRL is an open-source Python library developed to simplify the process
of implementing and running Reinforcement Learning (RL) experiments. Compared
to other available libraries, MushroomRL has been created with the purpose of
providing a comprehensive and flexible framework to minimize the effort in
implementing and testing novel RL methodologies. Indeed, the architecture of
MushroomRL is built in such a way that every component of an RL problem is
already provided, and most of the time users can only focus on the
implementation of their own algorithms and experiments. The result is a library
from which RL researchers can significantly benefit in the critical phase of
the empirical analysis of their works. MushroomRL stable code, tutorials and
documentation can be found at https://github.com/MushroomRL/mushroom-rl.Comment: Under revision to JML
A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning
Across machine learning, the use of curricula has shown strong empirical
potential to improve learning from data by avoiding local optima of training
objectives. For reinforcement learning (RL), curricula are especially
interesting, as the underlying optimization has a strong tendency to get stuck
in local optima due to the exploration-exploitation trade-off. Recently, a
number of approaches for an automatic generation of curricula for RL have been
shown to increase performance while requiring less expert knowledge compared to
manually designed curricula. However, these approaches are seldomly
investigated from a theoretical perspective, preventing a deeper understanding
of their mechanics. In this paper, we present an approach for automated
curriculum generation in RL with a clear theoretical underpinning. More
precisely, we formalize the well-known self-paced learning paradigm as inducing
a distribution over training tasks, which trades off between task complexity
and the objective to match a desired task distribution. Experiments show that
training on this induced distribution helps to avoid poor local optima across
RL algorithms in different tasks with uninformative rewards and challenging
exploration requirements
Monte-Carlo tree search with uncertainty propagation via optimal transport
This paper introduces a novel backup strategy for Monte-Carlo Tree Search
(MCTS) designed for highly stochastic and partially observable Markov decision
processes. We adopt a probabilistic approach, modeling both value and
action-value nodes as Gaussian distributions. We introduce a novel backup
operator that computes value nodes as the Wasserstein barycenter of their
action-value children nodes; thus, propagating the uncertainty of the estimate
across the tree to the root node. We study our novel backup operator when using
a novel combination of -Wasserstein barycenter with -divergence,
by drawing a notable connection to the generalized mean backup operator. We
complement our probabilistic backup operator with two sampling strategies,
based on optimistic selection and Thompson sampling, obtaining our Wasserstein
MCTS algorithm. We provide theoretical guarantees of asymptotic convergence to
the optimal policy, and an empirical evaluation on several stochastic and
partially observable environments, where our approach outperforms well-known
related baselines